Hotelling's T-squared distribution

In statistics Hotelling's T-squared distribution is important because it arises as the distribution of a set of statistics which are natural generalisations of the statistics underlying Student's t distribution. In particular, the distribution arises in multivariate statistics in undertaking tests of the differences between the (multivariate) means of different populations, where tests for univariate problems would make use of a t-test. It is proportional to the F distribution.

The distribution is named for Harold Hotelling, who developed it[1] as a generalization of Student's t distribution.

Contents

The distribution

If the notation T^2_{p,m} is used to denote a random variable having Hotelling's T-squared distribution with parameters p and m then, if a random variable X has Hotelling's T-squared distribution,


X \sim T^2_{p,m}

then[1]


\frac{m-p%2B1}{pm} X\sim F_{p,m-p%2B1}

where F_{p,m-p%2B1} is the F-distribution with parameters p and m-p%2B1.

Hotelling's T-squared statistic

Hotelling's T-squared statistic is a generalization of Student's t statistic that is used in multivariate hypothesis testing, and is defined as follows.[1]

Let \mathcal{N}_p(\boldsymbol{\mu},{\mathbf \Sigma}) denote a p-variate normal distribution with location \boldsymbol{\mu} and covariance {\mathbf \Sigma}. Let

{\mathbf x}_1,\dots,{\mathbf x}_n\sim \mathcal{N}_p(\boldsymbol{\mu},{\mathbf \Sigma})

be n independent random variables, which may be represented as p\times1 column vectors of real numbers. Define

\overline{\mathbf x}=\frac{\mathbf{x}_1%2B\cdots%2B\mathbf{x}_n}{n}

to be the sample mean. It can be shown that


n(\overline{\mathbf x}-\boldsymbol{\mu})'{\mathbf \Sigma}^{-1}(\overline{\mathbf x}-\boldsymbol{\mathbf\mu})\sim\chi^2_p ,

where \chi^2_p is the chi-squared distribution with p degrees of freedom. However, {\mathbf \Sigma} is often unknown and we wish to do hypothesis testing on the location \boldsymbol{\mu}.

Define

{\mathbf W}=\frac{1}{n-1}\sum_{i=1}^n (\mathbf{x}_i-\overline{\mathbf x})(\mathbf{x}_i-\overline{\mathbf x})'

to be the sample covariance. Here we denote transpose by an apostrophe. It can be shown that \mathbf W is positive-definite and follows a p-variate Wishart distribution with n-1 degrees of freedom.[2] Hotelling's T-squared statistic is then defined to be


t^2=n(\overline{\mathbf x}-\boldsymbol{\mu})'{\mathbf W}^{-1}(\overline{\mathbf x}-\boldsymbol{\mathbf\mu})

because it can be shown that

t^2 \sim T^2_{p,n-1}

i.e.

\frac{n-p}{p(n-1)}t^2 \sim F_{p,n-p} ,

where F_{p,n-p} is the F-distribution with parameters p and n-p. In order to calculate a p value, multiply the t^2 statistic by the above constant and use the F distribution.

Hotelling's two-sample T-squared statistic

If {\mathbf x}_1,\dots,{\mathbf x}_{n_x}\sim N_p(\boldsymbol{\mu},{\mathbf V}) and {\mathbf y}_1,\dots,{\mathbf y}_{n_y}\sim N_p(\boldsymbol{\mu},{\mathbf V}), with the samples independently drawn from two independent multivariate normal distributions with the same mean and covariance, and we define

\overline{\mathbf x}=\frac{1}{n_x}\sum_{i=1}^{n_x} \mathbf{x}_i \qquad \overline{\mathbf y}=\frac{1}{n_y}\sum_{i=1}^{n_y} \mathbf{y}_i

as the sample means, and

{\mathbf W}= \frac{\sum_{i=1}^{n_x}(\mathbf{x}_i-\overline{\mathbf x})(\mathbf{x}_i-\overline{\mathbf x})'
%2B\sum_{i=1}^{n_y}(\mathbf{y}_i-\overline{\mathbf y})(\mathbf{y}_i-\overline{\mathbf y})'}{n_x%2Bn_y-2}

as the unbiased pooled covariance matrix estimate, then Hotelling's two-sample T-squared statistic is

t^2 = \frac{n_x n_y}{n_x%2Bn_y}(\overline{\mathbf x}-\overline{\mathbf y})'{\mathbf W}^{-1}(\overline{\mathbf x}-\overline{\mathbf y})
\sim T^2(p, n_x%2Bn_y-2)

and it can be related to the F-distribution by[2]

\frac{n_x%2Bn_y-p-1}{(n_x%2Bn_y-2)p}t^2 \sim F(p,n_x%2Bn_y-1-p).

The non-null distribution of this statistic is the noncentral F-distribution (the ratio of a non-central Chi-squared random variable and an independent central Chi-squared random variable)

\frac{n_x%2Bn_y-p-1}{(n_x%2Bn_y-2)p}t^2 \sim F(p,n_x%2Bn_y-1-p;\delta),

with

\delta = \frac{n_x n_y}{n_x%2Bn_y}\boldsymbol{\nu}'\mathbf{V}^{-1}\boldsymbol{\nu},

where \boldsymbol{\nu} is the difference vector between the population means.

See also

References

  1. ^ a b c Hotelling, H. (1931). "The generalization of Student's ratio". Annals of Mathematical Statistics 2 (3): 360–378. doi:10.1214/aoms/1177732979. 
  2. ^ a b K.V. Mardia, J.T. Kent, and J.M. Bibby (1979) Multivariate Analysis, Academic Press.

External links